Starting with Data

Data Carpentry for Social Sciences and Humanities

2024-06-20

Objectives

  • Describe what a data frame is.
  • Load external data from a .csv file into a data frame.
  • Summarize the contents of a data frame.
  • Subset values from data frames.
  • Describe the difference between a factor and a string.
  • Convert between strings and factors.
  • Reorder and rename factors.
  • Change how character strings are handled in a data frame.
  • Examine and change date formats.

Objectives

  • Describe what a data frame is.

It’s a table…

Objectives

  • Describe what a data frame is. 👍

It’s a table…

Data frames and tibbles

Data frames can contain multiple data types.

Each column in a data frame must only contain one type.

A tibble is a type of data frame used by the 📦{tidyverse} packages.

Table 1: The SAFI clean data.

SAFI dataset

Absolute file paths

The location of a file on your computer.

A file in your home directory may be

C:\User\bagginsf\there-and-back-again.docx

or

~/there-and-back-again.docx

depending on your OS. These are absolute paths.

Relative file paths

A cartoon showing two paths side-by-side. On the left is a scary spooky forest, with spiderwebs and gnarled trees, with file paths written on the branches like “~/mmm/nope.csv” and “setwd(“/haha/good/luck/”), with a scared looking cute fuzzy monster running out of it. On the right is a bright, colorful path with flowers, rainbow and sunshine, with signs saying “here!” and “it’s all right here!” A monster facing away from us in a backpack and walking stick is looking toward the right path. Stylized text reads “here: find your path.”

Artwork by @allison_horst

With the 📦{here} package combined with an RStudio project we can use relative paths.

This means we start the file path at the root of the project, and don’t have to worry about anything that comes before.

So if our data file is stored in a folder called data, the file path will be ./data/mydata.csv

where . is the project root.

Important

Relative paths are useful when collaborating on projects. Imagine sharing a script that loads data from C:/This/path/only/exists/for/me.csv

Exercise 1

  1. Create a tibble (interviews_100) containing only the data in row 100 of the interviews dataset.

Now, continue using interviews for each of the following activities:

  1. Notice how nrow() gave you the number of rows in the tibble?
  • Use that number to pull out just that last row in the tibble.
  • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
  • Pull out that last row using nrow() instead of the row number.
  • Create a new tibble (interviews_last) from that last row.
  1. Using the number of rows in the interviews dataset that you found in question 2, extract the row that is in the middle of the dataset. Store the content of this middle row in an object named interviews_middle. (hint: This dataset has an odd number of rows. Use the median() function and what you’ve learned about sequences in R to extract the middle row!)
  2. Combine nrow() with the - notation to reproduce the behavior of head(interviews), keeping just the first through 6th rows of the interviews dataset.

Solution

## 1.
interviews_100 <- interviews[100, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(interviews)
interviews_last <- interviews[n_rows, ]
## 3.
interviews_middle <- interviews[median(1:n_rows), ]
## 4.
interviews_head <- interviews[-(7:n_rows), ]

Exercise 2

  • Rename the levels of the factor to have the first letter in uppercase: “No”, “Undetermined”, and “Yes”.

  • Now that we have renamed the factor level to “Undetermined”, can you recreate the barplot such that “Undetermined” is last (after “Yes”)?

Solution

memb_assoc <- interviews$memb_assoc
## Rename levels
memb_assoc <- fct_recode(memb_assoc, No = "no",
                         Undetermined = "undetermined", Yes = "yes")
## Reorder levels. Note we need to use the new level names.
memb_assoc <- factor(memb_assoc, levels = c("No", "Yes", "Undetermined"))
plot(memb_assoc)

Questions

See if you can answer these questions

  • What is a data.frame?
  • How can I read a complete csv file into R?
  • How can I get basic summary information about my dataset?
  • How can I change the way R treats strings in my dataset?
  • Why would I want strings to be treated differently?
  • How are dates represented in R and how can I change the format?